LOQUS: Linked Open Data SPARQL Querying System

نویسندگان

  • Prateek Jain
  • Kunal Verma
  • Peter Z. Yeh
  • Pascal Hitzler
  • Amit P. Sheth
چکیده

The LOD cloud is gathering a lot of momentum, with the number of contributors growing manifold. Many prominent data providers have submitted and linked their data to other dataset with the help of manual mappings. The potential of the LOD cloud is enormous ranging from challenging AI issues such as open domain question answering to automated knowledge discovery. We believe that there is not enough technology support available to effectively query the LOD cloud. To this effect, we present a system called Linked Open Data SPARQL Querying System (LOQUS), which automatically maps users queries written in terms of a conceptual upper ontology to different datasets, creates a query plan, sends sub-queries to the different datasets, merges the results and presents them to the user. We present a qualitative evaluation of the system based on real-world queries posed by other researchers and practitioners. Introduction The Linked Open Data (LOD) methodology has recently emerged as a powerful way of linking together disparate data sources (Bizer, Heath, and Berners-Lee 2009). Using this methodology, researchers have interlinked data from diverse areas such as life sciences, nature, geography, and entertainment. Moreover, many prominent datasources (e.g. Wikipedia1, PubMed2, data.gov3, etc.) – have also adopted this methodology to interlink their data. The result is the LOD cloud 4 – a large and growing collection of interlinked public datasets represented using RDF and OWL. Concepts (and instances) in a dataset are connected to (and hence can be reached from) related concepts (and instances) from other datasets through semantic relationships such as owl:sameAs. Hence, the LOD cloud is becoming the largest currently available structured knowledgebase. It has a potential for applicability in many AI-related task such as open domain question answering, knowledge discovery, and the Semantic Web. An important prerequisite before the LOD cloud can enable these goals is allowing its users (and applications) to Copyright c © 2010, The authors. All rights reserved. http://en.wikipedia.org/wiki/Main Page http://www.ncbi.nlm.nih.gov/pubmed/ http://data.gov http://linkeddata.org/ effectively pose queries to and retrieve answers from it. This prerequisite, however, is still an open problem for the LOD cloud. For example, in order to answer the following query from Jamendo5 using the LOD cloud: Select artists within Jamendo who made at least one album tagged as ‘punk’ by a Jamendo user, sorted by the number of inhabitants of the places they are based near. This query requires user to select the relevant datasets, identify the concepts in these datasets that the query maps to, and merge the results from each dataset into a complete answer. These steps are very costly in terms of time and required expertise which is not feasible given the size (and continued growth) of the LOD cloud. Apart from the sheer size, issues such as schema heterogenity and entity disambiguation identified in (Jain et al. 2010) present profound challenges with respect to querying of the LOD cloud. In this paper, we present a Linked Open Data SPARQL Querying System (LOQUS) – which allows users to effectively pose queries to the LOD cloud without having to know the exact structure and links between its many datasets. LOQUS automatically maps the user’s query to the relevant datasets (and concepts) using an upper level ontology; then executes the resulting query; and finally merges the results into a single, complete answer. We perform a qualitative evaluation of LOQUS on several real-world queries and demonstrate that LOQUS allows users to effectively execute queries over the LOD cloud without a deep understanding of its datasets. We also compare LOQUS with existing query systems for the LOD cloud to highlight the pros and cons of each approach. The rest of the paper is organized as follows. We begin by providing the motivation behind our work. We then introduce our approach followed by an end-to-end example and evaluation. We conclude with related work, conclusion, and future work. Motivation SPARQL6 has emerged as the de-facto query language for the Semantic Web community. It provides a mechanism to express constraints and facts, and the entities matching those constraints are returned to the user. However, the syntax of http://dbtune.org/jamendo/ http://www.w3.org/TR/rdf-sparql-query/ SPARQL requires users to specify the precise details of the structure of the graph being queried in the triple pattern. To ease querying from an infrastructural perspective, data contributors have provided public SPARQL endpoints to query the LOD cloud datasets. But with respect to a systematic querying of the LOD cloud, we believe that the following challenges identified previously in (Jain et al. 2010) make the process difficult and should be addressed. • Intimate knowledge of datasets: To formulate a query which spans multiple datasets (such as the one mentioned in the introduction) the user has to be familiar with multiple datasets. The user also has to express the precise relationships between concepts in the RDF triple pattern, which even in trivial scenarios implies browsing at least two to three datasets. • Schema heterogeneity: The LOD cloud datasets cater to different domains, and thus require different modeling schemes. For example, a user interested in music related information has to skim through at least three different music related datasets such as Jamendo, MusicBrainz, MySpace. Even though the datasets belong to same domain, each have been modelled differently depending on the creator. This is perfectly fine from a knowledge engineering perspective, but it makes the querying of the cloud difficult as it requires users to understand the various heterogeneous schemas. This issue stems from the Lack of Conceptual Description of the LOD datasets. • Entity disambiguation: Often the LOD cloud datasets have overlapping domains and tend to provide information about the same entity. To exemplify, both DBpedia and Geonames have information about the city of Barcelona. Although Geonames references DBpedia using the owl:sameAs property, which can confuse the user as to which is the best source to answer the query. This problem gets even more compounded when contradictory facts are reported for the same entity by different datasets. For example, DBpedia quotes the population of Barcelona as 1,615,908, whereas according to Geonames it is 1,581,595. One can argue this might be because of a difference in the notion of the city of Barcelona. But that leads to another interesting question: Is the owl:sameAs property misused in the LOD cloud?. • Ranking of results: In scenarios where the results of the query can be computed and returned by multiple datasets, the result which should be ranked higher for a specific query becomes an interesting and important question. As presented above, the query related to population of Barcelona can be answered by multiple datasets, but which one of them is more relevant in a specific scenario?. This issue has been addressed from the perspective of popularity of datasets by considering the cardinalities and types of the relationships in (Toupikov et al. 2009), but not from the perspective of requirements with regard to a specific query. Our Approach From a bird’s eyes perspective, LOQUS accepts SPARQL queries serialized by the user using concepts from an upper level ontology. LOQUS identifies the datasets and the corresponding queries to be excuted on these datasets using primarily the mappings of upper level ontology to these LOD cloud datasets. This section introduces the architecture of our querying system, approach used for query execution, and the utilization of mappings for sub-query construction and the technique used for processing the results. Figure 1 illustrates the overall architecture of LOQUS. System Architecture LOQUS consists of the following modules (1) Upper level ontology mapped to the domain specific LOD datasets. (2) Module to identify the upper level concepts contained in the query and perform the translations to the LOD cloud datasets. (3) Module to split the query mapped to LOD datasets concepts into subqueries corresponding to different datasets. (4) Module to execute the queries remotely and process the results and deliver the final result to the user. Upper Level Ontology The upper level ontology has been created manually by reusing concepts from SUMO (Niles and Pease 2001) and by identifying their equivalent or subsuming concepts in the LOD cloud datasets. To demonstrate, the SUMO concept of Nation can map to different concepts belonging to the datasets of the LOD cloud such as http://dbpedia.org/ontology/Country (DBpedia), http://www.geonames.org/ontology#A.PCLI (Geonames) and http://data.linkedmdb.org/resource/movie/country (linkedmdb). These mappings are at the schema level, and thus complement the existing mappings at the instance level provided by LOD cloud. Thus, reusing SUMO provides a single point of reference for querying the LOD cloud and consequently helps in query formulation. Further, because the mappings are at the schema level, the ontology can be utilized for reasoning and knowledge discovery over LOD cloud datasets. Mapping of Upper Level Concepts to LOD Datasets Using the mappings from SUMO, the concepts specified in the query can be mapped to concepts of the LOD cloud datasets. The concepts from LOD cloud dataset are substituted in the basic graph pattern (in lieu of concepts from SUMO) of the SPARQL query to create a query containing only concepts from the LOD datasets. The presence or absence of multiple mappings for a given concept gives an indication if the corresponding subqueries (which are created in the next step) should be involved in a union or if they should be joined to each other. Hence, this step also helps in creating a query plan for the execution and processing of results of the sub-queries. Splitting of the Query Graph to Create Sub-Queries The SPARQL query containing the concepts from the LOD cloud datasets is partitioned into sub-queries corresponding

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information

With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...

متن کامل

Evaluation of SPARQL query generation from natural language questions

SPARQL queries have become the standard for querying linked open data knowledge bases, but SPARQL query construction can be challenging and timeconsuming even for experts. SPARQL query generation from natural language questions is an attractive modality for interfacing with LOD. However, how to evaluate SPARQL query generation from natural language questions is a mostly open research question. ...

متن کامل

Alignment-Based Querying of Linked Open Data

The Linked Open Data (LOD) cloud is rapidly becoming the largest interconnected source of structured data on diverse domains. The potential of the LOD cloud is enormous, ranging from solving challenging AI issues such as open domain question answering to automated knowledge discovery. However, due to an inherent distributed nature of LOD and a growing number of ontologies and vocabularies used ...

متن کامل

Federated Data Management and Query Optimization for Linked Open Data

Linked Open Data provides data on the web in a machine readable way with typed links between related entities. Means of accessing Linked Open Data include crawling, searching, and querying. Search in Linked Open Data allows for more than just keyword-based, document-oriented data retrieval. Only complex queries across different data source can leverage the full potential of Linked Open Data. In...

متن کامل

Querying over Federated SPARQL Endpoints - A State of the Art Survey

The increasing amount of Linked Data and its inherent distributed nature have attracted significant attention throughout the research community and amongst practitioners to search data, in the past years. Inspired by research results from traditional distributed databases, different approaches for managing federation over SPARQL Endpoints have been introduced. SPARQL is the standardised query l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010